Distributed representation and estimation of WFST-based n-gram models

نویسندگان

  • Cyril Allauzen
  • Michael Riley
  • Brian Roark
چکیده

We present methods for partitioning a weighted finite-state transducer (WFST) representation of an n-gram language model into multiple blocks or shards, each of which is a stand-alone WFST n-gram model in its own right, allowing processing with existing algorithms. After independent estimation, including normalization, smoothing and pruning on each shard, the shards can be reassembled into a single WFST that is identical to the model that would have resulted from estimation without sharding. We then present an approach that uses data partitions in conjunction with WFST sharding to estimate models on orders-of-magnitude more data than would have otherwise been feasible with a single process. We present some numbers on shard characteristics when large models are trained from a very large data set. Functionality to support distributed n-gram modeling has been added to the open-source OpenGrm library.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A method for structure estimation of weighted finite-state transducers and its application to grapheme-to-phoneme conversion

Weighted finite-state transducers (WFSTs) are widely used as a fundamental data structure in several spoken language processing systems since they can provide a unified representation of many types of probabilistic models. Even though the use of accurate WFSTs is important in many spoken language systems, WFSTs are conventionally obtained by transforming probabilistic models that are not estima...

متن کامل

Generalized Fast On-the-fly Com WFST-Based Speech R

This paper describes a Generalized Fast On-the-fly Composition (GFOC) algorithm for Weighted Finite-State Transducers (WFSTs) in speech recognition. We already proposed the original version of GFOC, which yields fast and memory-efficient decoding using two WFSTs. GFOC enables fast on-the-fly composition of three or more WFSTs during decoding. In many cases, it is actually difficult or impossibl...

متن کامل

Evaluations of an Open Source WFST-based Phoneticizer

This paper describes in detail some recent experiments for an Open-Source, WFST-based Grapheme-to-Phoneme system, Phonetisaurus. The system comprises several loosely coupled components and includes implementations of several G2P alignment algorithms, and simple 3-gram LMs, as well as support for several third-party components. Standard G2P evaluations were also performed on widely available tes...

متن کامل

Generalized fast on-the-fly composition algorithm for WFST-based speech recognition

This paper describes a Generalized Fast On-the-fly Composition (GFOC) algorithm for Weighted Finite-State Transducers (WFSTs) in speech recognition. We already proposed the original version of GFOC, which yields fast and memory-efficient decoding using two WFSTs. GFOC enables fast on-the-fly composition of three or more WFSTs during decoding. In many cases, it is actually difficult or impossibl...

متن کامل

Large vocabulary continuous speech recognition using WFST-based linear classifier for structured data

This paper describes a discriminative approach that further advances the framework for Weighted Finite State Transducer (WFST) based decoding. The approach introduces additional linear models for adjusting the scores of a decoding graph composed of conventional information source models (e.g., hidden Markov models and N -gram models), and reviews the WFSTbased decoding process as a linear class...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016